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We propose and discuss a novel strategy for protein design. The method is based on recent 
theoretical advancements which showed the importance to treat carefully the conformational free 
energy of designed sequences. In this work we show how computational cost can be kept to a 
minimum by encompassing negative design features, i.e. isolating a small number of structures 
that compete significantly with the target one for being occupied at low temperature. The method 
is succesfuUy tested on minimalist protein models and using a variety of amino acid interaction 
potentials. 



I. INTRODUCTION 

Since the late 50's it has been established that the na- 
tive state of a protein is enJ-kely and uniquely encoded 
by its amino acids sequencetla. One of the fondamental 
issues in molecular biology is understanding the relation 
between protein sequence and native structure. Remark- 
ably, this relation is not symmetric: while a given se- 
quence folds into a single structure, a given structure can 
be encoded by several homologous sequences. The prob- 
lem of predicting the native state of a sequence is com- 
monly known as " protein folding" . Its solution amounts 
to minimize the energy of the given peptidic chain over all 
possible conformations. The design problem, i.e. finding 
the sequence(s) that fold into a desired structure, has 
also been given a simple and rigorous formulation. At 
the "physiological" temperature f3~^ , the sequences that 
correctly design a given target structure, Tt j^their native 
state), maximize the Boltzmann probability^, 

P{s,TuPp) = e^^[~ pp[H,{Tt) - F^iPp)]] , (1) 

where s = (si, S2, . . . , sl) represents the amino acid se- 
quence (amino acid si at the first position in F, . . . , s^ at 
the last position in F) and Hg is the energy of s housed 
on r. Fs in eq. (|^) is the conformational free energy of 
the sequence s. 



Fs{P) = -r' In { J2 exp[-/3ifs(r)]} 



(2) 



where the summation is taken over all possible conforma- 
tions the sequence s can assume without violating steric 
constraints. Maximizing (|l|) poses some serious technical 
difficulties since, in principle, it entails an exploration of 
both sequence and structure spaceH'Q. Some simplifica- 
tions have been used in the past in order to limit the 
space of sequences; this is conveniently done by subdi- 
viding -amino acids into a limited number, q < 20, of 
classesElQ. Some approximation schemes have-|aIao been 
used to reduce the difficulty of calculating .F'sQ'HIS. Rea- 
sonable success has been obtained, for example, by 
tulating a suitable functional dependence of Fs on sE3 
Recently, it has also been argued that, despite the huge 



number of conformations, F, the most significant conlxi- 
bution to (0) comes from the closest competitors of F(Eil. 
These are limited in number, since they are among the 
ones sharing a subset of native contacts with Fj. 

In this article we propose a new technique apt for 
designing a given structure. Ft, using a minimal set of 
structure for calculating (pi). The technique is simple to 
implement and does not require to constrain sequence 
composition and/or to search solutions with the lowest 
energytj. Several exact tests have been implemented in 
order to assess the performance of the new method with 
respect to previously proposed techniques. 



II. THEORY: THE ITERATIVE DESIGN 
SCHEME 

In order to discuss the design method we introduce 
a Hamiltonian function, Hs{T) depending only on coarse 
grained degrees of freedom. A commonly used form is the 
one in terms of the contact matrix A2 (ri,rj) which is 1 
when |ri— Tjl < a with a ~ 6—8A, and otherwise. Other 
forms which smoothly interpolate between and 1 are 
also used in practice. Two amino acids, st and Sj, which 
are in contact contribute to the energy by an amount 
£2(^1, sA, aphenomenological symmetric matrix (see e.g. 
refs. [ ^3| |lj, ^5], |l^ ) . Many body interations can also 
be easily included in terms of a generalized contact maps 
Afc(rii , . . . , Ti^) depending only on relative distances and 
on extra energy parameters efe(sij, . . . , Si^.). Thus the 
energy can be written as 

Hs(^) = Yl Yl efc(sii,---,SiJAfc(f;i,...,r=*J. 



k>2 ii<i2<-<ik 



(3) 



Two structures which have the same values for all the 
Afc's, i.e. the same generalised contact map, will be re- 
garded as identical. This useful coarse-graining proce- 
dure neglects the fine structure fluctuacions (e.g. due to 
thermal excitations) and, for a sequence s with native 
state F, allows to define its folding temperature, (3p^, 
such that 



P(s,r,/3)>l/2 



(4) 



for all (3 > Pf- Conversely, all s's satsfying inequality (Q) 
have their unique ground state in T and folding temper- 
ature greater than (3~^ . 

Based on this observation the novel strategy for protein 
design can be formulated in terms of a scheme similar 
in spirit the one described in rcf [ 16 1. The essence of 
the procedure relies on the fact that the sum in (g) is 
carried out only on a limited set of structures D. Initially, 
D contains only the target structure itself and another 
structure with a different contact map and similar degree 
of compactness (chosen at random or with other criteria) . 
Upon iterating the procedure several design solutions will 
be identified and stored in set S. This set is, of course, 
initially empty. The steps to be iterated are as follows, 

1. An optimization procedure, like simulated anneal- 
ing, is used to explore sequence space and isolate 
the sequence s (not already included in S), such 
that 



l3[H,{Tt)-Tsm <ln2 



(5) 



T is calculated approximately by restricting the 
sum in (0) over the competitive structures held in 
D: 



Tsifi) = -/5-Mn{^ exp[-/3i/,(r)]} 



(6) 



TeD 



2. Then the lowest energy state(s), F of s are identified 
and the corresponding energy compared with that 
obtained by s on Fj. By definition, if F 7^ Ft and 
H{s,T) < H{s,Tt), then s is not a solution to the 
design problem and F is added to D. Otherwise, s, 
is added to the set of known solutions, S. 

The iterative procedure is repeated from step 1. The 
scheme stops when it is impossible to find sequences, 
satisying (|5|) not already included in S, or when a suffi- 
cient number of solutions has been retrieved. It is easy to 
see, using (||) and (^) that, in step 2, it can never happen 
that a newly chosen F 7^ Fj is already contained in D. 
Thus, at each iteration, new informations are collected, 
either in the form of a putative solution (added to S) or 
as a new decoy (added to D). 

Notice that, if the exact form of Fg were used instead of 
(n) , then the sequences in S would have a folding temper- 
ature greater than /?~^. However, since approximation 
(^ leads to systematically overestimating Fs{f3), it is not 
guaranteed that the selected sequences have f3p^ < f3~^. 
The inequality should however be satisfied to a better 
extent for larger decoys sets. 

The method outlined here is rigorous and its iterative 
application allows, in principle, to extract all sequences 
designing a given structure. Its pratical implementation 
may encounter difficulties at step 2, where it is required 



to find the low energy conformation(s) of a sequence. Se- 
quences selected at step 2 with a low P will correspond- 
ingly have a high folding temperature and are expected 
to be good folders. Hence, it is plausible that identify- 
ing the corresponding low energy states is much simpler 
than solving the general folding problem. In fact, we 
have gathered numerical evidence showing that strategy 
can be stopped as soon as a one finds a structure where 
is attained an energy lower than on Fj (even if true na- 
tive state has still lower energy). Notice that it is still 
necessary to have folding technique to allow to test if the 
design procedure is successful. Our iterative scheme is 
able to use informations of failed attempts in order to 
improve design at subsequent iterations. 



III. METHODS: IMPLEMENTATION AND TEST 
OF THE PROCEDURE 

To carry out a rigorous and exhaustive test of the 
proposed strategy we have restricted the space of struc- 
tures by discretizing the positions of amino acids, r,. 
We choose to follow the common practice of n^xiclhis 
the Ti's to occupy the nodes of a cubic latticetHiZHia'alia. 
This semplification allows for an exhaustive search of 
the whole conformation space for chain of a few dozen 
residues, albeit at the expenses of an accurate represen- 
tation of protein structures, as discussed in ref. [ P0[ . 
To mimic the high degree of compactness found in natu- 
rally occurring proteins, we first considered all the max- 
imally compact self-avoiding walks of length L = 27 em- 
bedded in a 3 X 3 X 3 cube. There are 103346 distinct 
oriented walks modulo rotations and reflections. This 
restriction is a good approximation if the interaction en- 
ergies between amino acids are sufficiently negative, so 
that compact conformations are favoured over loose ones. 
Without loss of generality we adopt a Hamiltonian where 
only pairwise interactions are considered (corresponding 
to fc = 2 in (H)). If the interaction energies are suffi- 
ciently attractive it is guaranteed that the native state is 
compact. Step 2 of the iterative procedure was carried 
out in two distinct ways. In a first attempt we found the 
true lowest energy state of s by exhaustive search. In a 
second attempt we tried to mimic the difficulty of finding 
the ground state in a realistic context and hence carried 
out a random partial exploration of the structure space. 
Although the first method was expected to be more ef- 
ficient than the second, their performance turned out to 
be almost identical, as we discuss below. 

The four target conformations used to test the proce- 
dure are given in Table | and shown in Fig ^-d. We 
used three possible choices for the e's. First, we adopted 
the standard 2-class HP model with chh = — 1 — a and 
ejjp = epp = —a. a is a suitable constant ensuring that 
native conformations are compact. Since all conforma- 
tions considered here have the same number of contacts 
the value of a is irrelevant and will be omitted from now 
on. The second case is a 6-class model and the e's are 



shown in Table 0. For the last case we considered the 
full repertoire of 20 amino acids used the Miyazawa and 
Jernigan energy parameters given in Table 3 of ref. [ |3| . 
With the standard HP parameters, structures Fi — F4 
have various degree of designability. The latter is defined 
as the numbec, of sequences admitting them as unique 
ground statea22l . Hence, the encodability of Fi and F2 is 
poor and average respectively, while F3 and F4 have very 
large encodability. It was shown that the degree of encod- 
ability is mainly a geometrical property of the structure 
and not too sensitive to the number of aiHJpa-acid classes 
or the values of interaction parameter£2l't3'll3. For this 
reason we expect that the relative encodability of Fi — F4 
remain different when using all the three sets of param- 
eters. 



IV. RESULTS AND DISCUSSION 

The "dynamical" performance of the algorithm can be 
seen in Figs. ga-c. The plots show the number of solu- 
tions retrieved as a function of the number of iterations 
at a "physiological" temperature equal to 0.1, 10.0 and 
0.7 for 2, 6 and 20 classes of amino acids, respectively. 
The different values of the physiological temperature are 
related to the different energy scales of the interactions. 

It can be seen that, after an initial transient, the per- 
formance of the method (given by the slope of the curves) 
is very high. In particular, for a large number of classes, 
it is nearly equal to 1 for all structures. 



{H, 



Hs{V) 



Table pro- 
vides a quantitative summary of the performanc e of the 
method. For the HP model, first column of Table III, the 



method was iterated until it could not find further solu- 
tions with (estimated) folding temperature greater that 
0.1 . For the cases of 6 and 20 classes, a very large num- 
ber of solutions exist. Hence, we stopped the procedure 
after 1000 or 500 iterations, depending on the number of 
classes. 

An appealing feature is that the extracted solutions 
show no bias for sequence composition (see Fig. Hd) or 
ground-state energy. This can be seen in Fig. y, where 
we have plotted the energies of 1000 designed solutions of 
fixed composition for the 6-classes case. Solutions do not 
exhibit packing around the minimum energy (« —830) 
and their energy spread is fairly wide (the estimated 
maximum energy is ~ —170). Furthermore, for each ex- 
tracted sequence we also calculated its folding tempera- 
ture, to compare it with 1/13. As we remarked, if all the 
significant competitors of Fj were included in D, then se- 
quences satisfying (||) should have folding temperatures 
greater than 1/(3. As shown in the typical plot of Fig. 
y this is almost always the case, ensuring that solutions 
can be extracted with a desired thermal stability. An al- 
ternative measure of the thermal stability connected to 
the cooperativity and rapidity of the folding process is 
iuence, s, designing structure F, the 



(7) 



where {Hs) is the average energy over the maximally 
compact conformations and Ug the standard deviation of 
the energy in this ensemble. Fig. H shows a scatter plot 
of extracted solutions for target structure Fi for the 20- 
letter case. It can be seen that there exist solutions with 
very high Z score throughout the displayed energy range. 
This proves the usefulness of the novel design technique 
which has no bias in native-state energy. In fact, it al- 
lows to collect equally good folders with a wide range of 
native-state energy (and hence very different sequences). 
This ought to be useful in realistic design contexts, where 
among all putative design solutions one may wish to re- 
tain those with specific amino acids in key protein sites. 
The ability to select sequences across the whole energy 
range highlights the efficiency of the technique. In fact, 
as shown in Fig. ^, away from the lowest energy edge, 
the fraction of good sequences over the total ones with 
the same energy is minuscule (note the logarithmic scale). 
Our method is able to span across the whole energy range 
without restricting to those of minimal energy, which are 
a negligible fraction of the total solutions. 

Finally, we analysed the degree of mutual similarity 
between extracted solutions. For the 6-classes case, the 
sequence similarity between solutions was rather low, be- 
ing around 20%, as can be seen in Fig. 0. This rules out 
the possibility that solutions correspond to few point mu- 
tations of a single prototype sequence. 

One of the most significant features of the novel design 
procedure is that the number of structures, D, used to 
calculate the approximate free energy, (g), can be kept 
to a negligible fraction of the total structures and yet 
allow a very efficient design. This is proved even more 
strikingly by a further test of our design strategy in the 
whole space of both compact and non-compact confor- 
mations. We carried out a design of structure F2 by 
using the HP parameters with the constant a set to 0. 
This amounts to allow for non-compact conformations to 
be native states. Since it is unfeasible to explore this 
enlarged structure space, step 2 was carried out with a 
stochastic Monte Carlo process, as described in refs. [ 
|3|, 1^, which generated dynamically growing low-energy 
conformations at a suitable fictitious Monte Carlo tem- 
perature. The correctness of the putative solutions was 
carried out by using an algorithm knownj-aAjConstrained 
Hydrophobic Core Construction (CHCC)BM The algo- 
rithm relies on an efficient pruning of the complete search 
tree in finding possible low energy conformations for a se- 
quence. At the heart of the algorithm is the observation 
that the most energetically convenient conformations for 
the hydrophobic monomers is to form a compact, cubic- 
like, core. This ideal situation may not be reachable for 
arbitrary sequences, due to frustration effects; these are 
taken systematically into account to build a compact core 
with a number of cavities sufficient to expose P singlets 
(i.e. a P flanked by two H monomers in the sequence) 



on the surface, which is energetically more effective than 
burying them in the core. Then, exhaustive search algo- 
rithms are used to check the compatibility of a sequence 
with cores of increasing surface area (i.e. decreasing en- 
ergy) . A detailed description of the method can be found 
in Refs. [ H, ||. The time required by CHCC to find 
the ground state energy of a sequence increases signifi- 
cantly, on average, with the increase of the number of H 
residues. For this reason we limited the search for de- 
sign solutions to sequences with ur = 13. The solutions, 
obtained in about one hundred iterations, appears in ta- 
ble (IV). All the 23 extracted solutions had r2 as the 
unique ground state among the compact structures, and 
17 of them retained r2 as ground state even when non- 
compact structures are considered. Given the vastity of 
the enlarged structure space this represent a remarkable 
result. 



V. CONCLUSIONS 

We have presented a novel approach to protein design 
that encompasses negative design features. Taking the 
latter into proper account has been shown to be crucial 
for a succesful protein design. From a practical point of 
view this amounts to calculating the conformational free 
energy of all sequences which are candidate solutions. 
This computational intensive task is kept to a minimum 
in our scheme thanks to the identification of a limited 
number of structures which are close competitors of the 
target conformations. The strategy, is easy to implement 
and has been tested on minimalist models. The method 
appears to be very efficient and reliable for a variety of 
different sets of amino acid interactions. Contrary to 
other design techniques, the extracted solutions show no 
bias in sequence composition or native state energy and 
can be chosen to have a desired thermal stability. 

We thank J. Banavar, G. Morra, F. Seno and G. Set- 
tanni for discussions and suggestions. We acknowledge 
support from the Istituto Nazionale di Fisica della Ma- 
teria. 
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relative structures 


Enc. 


Ti 


URDDLLFFRBRFULLBBUFFRRBDLU 


25 


Ta 


UURFLFDBRBDFLFRRBBUUFFLDRB 


337 


Ts 


UURRFDLULDDFUURRDDBBULDFFU 


1224 


r4 


UURRDLFFRBULLDDRBRFFLLUURR 


1303 



Correct solutions 



TABLE I. The four structures used for benchmarking the 
design strategy. The conformations are encoded in bond di- 
rections: U, up; D, down; L, left; R, right; F, forward; B, 
backward. The encodability in the rightmost column is de- 
fined as the number of sequences admitting the corresponding 
structure as their unique native state (HP interactions are as- 
sumed). 
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-20.49 


-38.20 


-6.65 


-43.65 


-10.63 
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-20.49 


-14.91 


-18.13 


-4.00 


-15.56 


-3.81 
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-38.20 


-18.13 


-35.75 


-5.07 


-23.96 


-26.02 
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-6.65 


-4.00 


-5.07 


-1.65 


-5.17 


-9.47 
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-43.65 


-15.56 


-23.96 


-5.17 


-43.71 


-18.63 


6 


-10.63 


-3.81 


-26.02 


-9.47 


-18.63 


-26.70 



TABLE IL Energy parameters for t 
rameters obey the segregation principl 



6-class model. Pa- 





HP 


6 classes 


20 classes 
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Nsol 


Nu 


Nsol 


Li 


62 
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1000 


895 


500 


388 


Fa 


722 


337 


1000 


891 


500 


419 


Ls 


1898 


1219 


1000 


906 


500 


423 


r4 


1719 


1297 


1000 


911 


500 


457 



TABLE HL Number of extracted solutions, Naoi, after Nu 
iterations of the design procedure. For the HP model Nu 
is the number of iterations at which the iterative scheme 
stopped. It was verified that the 1297 extracted solutions 
for structure r4 have a folding temperature between 0.15 and 
0.6. 



010111001110110001010100001 
000011011100111101000100101 
000010011100111101000101101 
000010000111100101110101101 
000010010100101111000110111 
000010000110100111010110111 
000010110100100111000101111 
000010000110100111010110111 
000010110100100111000101111 
000010010100101111000101111 
000010000110100111010101111 
000010110100100101000111111 
000010010100101101000111111 
000010100100100111000111111 
000010010100100111000111111 
000010000110100101010111111 
000010000100100111010111111 



Incorrect solutions 



110010001110110001010101001 
010011001110110100010100101 
110010001100110101000101101 
100011001100110101000101101 
000010100100100111010101111 



TABLE IV. Extracted solutions for structure Fa. The de- 
sign attempt was carried out in the whole space of conforma- 
tions with arbitrary degree of compactness. 




FIG. 1. The target structures Fi (top left), r2 (top right), F3 (bottom left), F4 (bottom right). 
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FIG. 2. Number of extracted solutions versus the number of iterations for HP interactions (top left), 6 amino acid classes 
(top right) and 20 classes (left bottom). The ideal curve, corresponding to efficiency 1, should have slope 1. Plots referring to 
structures Pi, P2, Pa, P4 are denoted with continuous, dotted, dashed and long-dashed lines respectively. Bottom right panel 
represents the histogram of the number of extracted solutions a a function of sequence composition (HP model). Curves pertain 
to an HP-design attempt on structure P4 at different values of Nu : 200, 400, 800, 1719. It can be seen that the efficiency of the 
design technique is independent of the sequence composition. 
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FIG. 3. Energy of the solutions found for structure r4 
(6-class model) at fixed composition (4, 4, 4, 5, 5, 5) 
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FIG. 5. Scatter plot of the Zscore against native-state en- 
ergy of extracted solutions designing structure Fi . The data 
are for a 20-letter alphabet of amino acids at fixed and nearly 
uniform composition. 
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FIG. 4. Folding temperatures of solutions designing struc- 
ture r4 (6-class model) as a function of the order of extraction. 
Very few solutions turn out to have a folding temperature be- 
low the simulation temperature T — 10 (shown with a dotted 
line) . 
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FIG. 6. Solid line: distribution (in arbitrary units) of so- 
lutions (good sequences) to the design problem on structure 
Fi (20 letter alphabet). The dotted line denotes the distri- 
bution containing bad sequences. The data was obtained by 
randomly sampling lO'^ sequences with fixed composition. 
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FIG. 7. Histogram of the average overlap (sequence iden- 
tity) of solutions for r4 (6-class model). For a given sequence 
the average overlap is calculated over all other extracted so- 
lutions. 



